from IPython.display import HTML, display, Image
HTML('''
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script><script>code_show=true;
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script><form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

Term 3 LT9 | Christine Albao, William Delfin, Felicismo Lazaro III, Ian Lucas, Loraine Menorca
With the rise of social media in recent years, different industries have started to take notice and explore how they can use the power of these platforms into improving and expanding their business [10]. One of these industries is Marketing. Marketing has undergone a revolution with the rise of social media. No longer do individuals need to spend large amounts of money for advertising in order to get their products or services known [9]. Facebook, TikTok, twitter, and Instagram are a few examples of social media platforms now being used by different businesses and industries in their marketing operations.
Along the social media revolution of marketing came about the rise of influencers and influencer marketing. Influencers are social media personalities or entities who are known to be knowledgeable or entertaining about a certain topic. These influencers then grow their community by gaining followers within the social media platforms. With a big enough following, they start to gain the power to influence buying habits and decisions as well as trends on different products withing their community. And in the Philippines, influencer marketing effectiveness is compounded by the fact that the Philippines is the second most active country in terms of social media consumption worldwide.
In this project, we explored how we can use the power of clustering and recommender systems to give out suggestions of which influencers a brand can tap into for potential partnerships and advertising deals in Twitter. As a specific example for this project, we used the brand dove and their twitter account as an example of a brand that needs influencer recommendations. We then used a twitter database compiled from scraping tweets of dove, their followers and other major Filipino influencers like Aldub to be used for the clustering and recommender systems.
After undergoing the standard data cleaning, exploration and preprocessing, we then performed clustering on the datapoints and have identified three main clusters, namely, News and Entertainment Outlets, Micro and Macro Influencers, and Celebrities. For the recommender system, we used a content-based recommender system. This system requires the creation of item and user profiles. For the Item-profiles we used the account as rows, and their tweet keywords as features. The user profiles on the other hand was aggregated by using the number of followers and tweets. Results of the clustering and recommendations of the recommender systems have identified that some key personalities like Tirso Cruz or Liza Soberano, to name a few, are ideal candidates for Dove to partner with. These personalities also align with the values that Dove stands for like empowerment and self-love.
Table 1. Influencer Profile (df_partners) - Data Dictionary
Table 2. Influencer Tweets (df_partner_tweets) - Data Dictionary
Figure 1. Data Source Overview
Figure 2. Distribution of Profile Metrics
Figure 3. Distribution of Followers and Tweet Count
Figure 4. Distribution of Tweet Metrics
Figure 5. Distribution of Account-Following count […]
Figure 6. The Project Methodology
Figure 7. The Model Pipeline
Figure 8. Clustering Internal Validation Criteria
Figure 9. Final k-medoids clustering retaining only 3 clusters
Figure 10. Cluster 1: News/Entertainment Outlets, 54 Twitter accounts
Figure 11. Cluster 2: Celebrities, 194 Twitter accounts
Figure 12. Cluster 3: Social Media Macro & Micro Influencers, 659 Twitter accounts
Figure 13. Tweets of Recommended News/Entertainment Outlets
Figure 14. Tweets of Recommended Celebrities
Figure 15. Tweets of Recommended Social Media Micro- & Macro-Influencers
Figure 16. Recommender System Performance
The power of social media is self-evident -- it builds relationships, shares experiences, and even educates people to a great extent. In marketing, however, social media presence has become essential. It allows brands to reach people and potential clients in ways it has been unable to do pre-social media. While brands and companies can run ads on various platforms through their business accounts, another arguably more effective and powerful avenue is the utilization of influencer marketing. According to the Influencer Marketing Hub: "Influencer marketing involves a brand collaborating with an online influencer to market one of its products or services." [11]
The effectiveness of influencer marketing is no longer up for debate. However, choosing who to partner with can be a significantly involved and relatively expensive process -- not only do companies have to consider the influencers' reach, but more importantly, they have to consider the personality's values and how it aligns with theirs. A typical way for brands to reach influencers is through ad agencies. Agencies usually have a pool of influencers they can tap to partner with clients, and while this may greatly hasten the shortlisting process, agency fees are significant. High agency fees are why some brands choose to contact influencers directly instead. The obvious downside is the time and effort involved in scouting and communicating with potential partners.
Machine learning and information retrieval techniques help alleviate the tedious process of choosing who to partner with and can significantly improve the quality of the outcome. Using a brand's current social network and aggregating networks of the most influential personalities can provide a list of potential celebrity and influencer (micro and macro) partners -- greatly streamlining the process and likely cutting on expenses.
The necessary details of the scraping process is already detailed in this notebook in this section. However, a separate notebook lt9_dmw2_finalproject_scraping.ipynb file which lays out the whole process is also provided for more information.
Figure 1. Data Source Overview.
The profiles of the brand and influencers were scraped via Twitter API. Similarly, the tweets of the influencers were collected.
import numpy as np
import pandas as pd
import sqlite3
# Plotting tools
import seaborn as sns
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objects as go
import plotly.express as px
# NLP tools
from collections import Counter
from tqdm import tqdm, trange
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
from IPython.display import display, HTML
# Clustering
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA, TruncatedSVD
from scipy.spatial.distance import euclidean, cityblock
from sklearn.base import clone
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.cluster import OPTICS, cluster_optics_dbscan
from sklearn.neighbors import NearestNeighbors
from pyclustering.cluster.kmedoids import kmedoids
from scipy.cluster.hierarchy import dendrogram, fcluster
from sklearn.metrics import (calinski_harabasz_score,
silhouette_score,
davies_bouldin_score)
# Recommentation System
from sklearn.metrics import dcg_score, ndcg_score
from scipy.spatial.distance import cosine
np.random.seed(143)
randstate = 143
# set global plotting parameters
custom_sns_params = {'lines.linewidth': 2, 'font.size': 12,
'axes.titlesize': 14, 'axes.labelsize': 12,
'xtick.labelsize': 12, 'ytick.labelsize': 12,
'legend.fontsize': 12, 'legend.fancybox': True}
sns.set_theme('notebook', style='ticks', rc=custom_sns_params)
colors = ['#003B7F', '#EDC254']
custom_palette = sns.blend_palette(colors) # , n_colors=5
sns.set_palette(custom_palette)
# define a global parameter figure counter
fig_n = 2
def fig_count():
global fig_n
fig_n += 1
return fig_n
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-xnqb9_l4 because the default path (/home/mmenorca/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing. [nltk_data] Downloading package punkt to [nltk_data] /home/msds2023/mmenorca/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] /home/msds2023/mmenorca/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] /home/msds2023/mmenorca/nltk_data... [nltk_data] Package wordnet is already up-to-date!
Influencers
We considered the users that an account follows to be its social network, and defined influencers as users who have **at least 50,000** followers. In this work, 3 social networks were considered as the pool of potential partners that a brand can be matched with - Anne Curtis', Alden Richards, and the brand's itself. Overall, 907 influencers were collected.
Since Anne Curtis and Alden Richards were among the top 10 most followed Twitter users in the Philippines as of 2016 [1], it was assumed that their social networks could represent a good sample set of influencers in the country. On the other hand, the brand's social circle was also added in the pool of potential partners assuming that they represent influencers that the brand already considers a good match.
The `https://api.twitter.com/2/users/:id/following` endpoint was used to get the 3 social networks that make up the 907 pool of influencers. This query returns the profile of each whose contents are described briefly in Table 1.
| Feature | Data Type | Description |
|---|---|---|
| id | string | unique identifier for each twitter user |
| description | string | twitter bio description |
| created_at | datetime | date of account creation |
| username | string | twitter user username |
| protected | integer | indicates if twitter account is protected or not |
| name | string | real or formal name of twitter user |
| url | string | url link of twitter account |
| location | string | address of twitter user |
| followers_count | integer | indicates the number of followers the user has |
| tweet_count | integer | indicates the number of tweets the user has posted |
| listed_count | integer | indicates the number of lists the user is in |
| included | integer | indicates if twitter user is included into the consideration set |
| rating | integer | indicates the rating of twitter user |
Tweets
100 of the most recent tweets of each influencer as of 05 March 2023 was collected. The content of each influencers' tweets were assumed to reflect their ideals, values, and tone. This was used as the basis for evaluating whether an influencer is a good match with the brand. Overall, 88,135 tweets were gathered.
The `https://api.twitter.com/2/users/:id/tweets` endpoint was used to get the tweets of each influencer. This contents of the scraped tweets are described briefly in Table 2.
| Feature | Data Type | Description |
|---|---|---|
| lang | string | indicates the language used for the tweet |
| id | integer | unique identifier of the tweet |
| created_at | datetime | date of creation of the tweet |
| possibly_sensitive | integer | indicates if topic of tweet is sensitive or not |
| author_id | integer | unique identifier for the author of the tweet |
| conversation_id | integer | unique identifier for the specific tweet and its replies |
| text | string | content of the tweet |
| in_reply_to_user | integer | indicates if the tweet is a reply to another user id |
| retweet count | integer | indicates the number of times the tweet was shared |
| reply_count | integer | indicates the number of replies the tweet has recieved |
| like_count | integer | indicates the number of likes the tweet has recieved |
| quote_count | integer | indicates the number of times the tweet was quoted by another user |
| impression_count | integer | indicates the number of impressions the tweet has garnered |
# Main database
sqlite_db = 'dmw2_final_project.db'
conn = sqlite3.connect(sqlite_db)
# Load necessary tables
tbl_partners = 'partners'
df_partners = pd.read_sql(f"SELECT * FROM {tbl_partners}",
parse_dates=['created_at'], con=conn)
tbl_partner_tweets = 'partner_tweets'
df_partner_tweets = pd.read_sql(f"SELECT * FROM {tbl_partner_tweets}",
parse_dates=['created_at'], con=conn)
df_partners.head()
| id | description | created_at | username | protected | name | url | location | followers_count | following_count | tweet_count | listed_count | included | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49616273 | China's national English language newspaper, u... | 2009-06-22 12:41:39+00:00 | globaltimesnews | 0 | Global Times | https://t.co/LgROMWT42V | Beijing, China | 1880407 | 538 | 229828 | 0 | 1.0 | 1.0 |
| 1 | 293932241 | Always Thankful. ask@vmgasia.co | 2011-05-06 07:09:52+00:00 | AlyssaValdez2 | 0 | Alyssa Valdez | https://t.co/NuWOu0Mt66 | None | 2258385 | 608 | 5304 | 445 | 1.0 | 1.0 |
| 2 | 42335426 | Filipina Wife, Mother of 5, Homemaker, Actress... | 2009-05-25 02:48:59+00:00 | mommymaricel | 1 | Maricel Laxa-P. | http://t.co/T8kqyg478t | Manila, Philippines | 69898 | 109 | 6333 | 104 | 1.0 | 1.0 |
| 3 | 333955253 | Writer, moon child, cat mom, fangirl and dream... | 2011-07-12 10:38:32+00:00 | iamAlyloony | 0 | Aly 🌑🌸 | https://t.co/Y27e5bUvnm | PH | 340964 | 178 | 136503 | 186 | 1.0 | 1.0 |
| 4 | 58155585 | Curiouser and curiouser. | 2009-07-19 08:20:45+00:00 | KianaVee | 0 | Kiana V | https://t.co/1MMICAW9u5 | None | 90953 | 844 | 20016 | 72 | 1.0 | 1.0 |
df_partner_tweets.head()
| lang | id | created_at | possibly_sensitive | author_id | conversation_id | text | in_reply_to_user_id | retweet_count | reply_count | like_count | quote_count | impression_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | en | 1632275246328823808 | 2023-03-05 07:01:59+00:00 | 0 | 49616273 | 1632275246328823808 | China's deficit-to-GDP ratio is set at 3 perce... | None | 0 | 0 | 0 | 0 | 55 |
| 1 | en | 1632274933089808386 | 2023-03-05 07:00:44+00:00 | 0 | 49616273 | 1632274933089808386 | Fifteen national advisors issued a joint propo... | None | 0 | 0 | 1 | 0 | 259 |
| 2 | en | 1632267673399701504 | 2023-03-05 06:31:54+00:00 | 0 | 49616273 | 1632267673399701504 | The number of giant pandas at SW China's Cheng... | None | 1 | 0 | 12 | 0 | 2478 |
| 3 | en | 1632258378243391488 | 2023-03-05 05:54:57+00:00 | 0 | 49616273 | 1632258378243391488 | China’s major public hospitals should increase... | None | 3 | 2 | 7 | 0 | 3139 |
| 4 | en | 1632245705749450753 | 2023-03-05 05:04:36+00:00 | 0 | 49616273 | 1632245705749450753 | "The Wandering Earth 2 let the audience see a ... | None | 7 | 1 | 9 | 1 | 3662 |
This work considers Dove as the stakeholder who aims to find social media influencers that best match their brand image and values. Moving forward, "Dove" and the "Brand" may be used interchangeably.
Dove's social network was obtained to determine which users are already "liked" by the brand. The tweets of Dove's social network were used as a basis for the brand's tone, values, and ideals that would be compared with that of the other potential partners in the matching process.
df_partners.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 915 entries, 0 to 914 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 915 non-null object 1 description 915 non-null object 2 created_at 915 non-null datetime64[ns, UTC] 3 username 915 non-null object 4 protected 915 non-null int64 5 name 915 non-null object 6 url 754 non-null object 7 location 687 non-null object 8 followers_count 915 non-null int64 9 following_count 915 non-null int64 10 tweet_count 915 non-null int64 11 listed_count 915 non-null int64 12 included 915 non-null float64 13 rating 915 non-null float64 dtypes: datetime64[ns, UTC](1), float64(2), int64(5), object(6) memory usage: 100.2+ KB
df_partners.describe()
| protected | followers_count | following_count | tweet_count | listed_count | included | rating | |
|---|---|---|---|---|---|---|---|
| count | 915.000000 | 9.150000e+02 | 915.000000 | 9.150000e+02 | 915.000000 | 915.0 | 915.000000 |
| mean | 0.007650 | 3.934655e+06 | 6184.685246 | 5.919677e+04 | 9202.324590 | 1.0 | 0.064481 |
| std | 0.087178 | 1.147208e+07 | 50445.232292 | 1.253672e+05 | 28201.544379 | 0.0 | 0.245742 |
| min | 0.000000 | 5.021600e+04 | 0.000000 | 8.000000e+00 | 0.000000 | 1.0 | 0.000000 |
| 25% | 0.000000 | 1.817210e+05 | 195.000000 | 8.539500e+03 | 377.500000 | 1.0 | 0.000000 |
| 50% | 0.000000 | 6.242060e+05 | 513.000000 | 2.013600e+04 | 1356.000000 | 1.0 | 0.000000 |
| 75% | 0.000000 | 2.431820e+06 | 1037.500000 | 4.703600e+04 | 5994.000000 | 1.0 | 0.000000 |
| max | 1.000000 | 1.134631e+08 | 853552.000000 | 1.149782e+06 | 534074.000000 | 1.0 | 1.000000 |
df_partners.hist(figsize=(20,10))
plt.suptitle(f'Fig. {fig_n}: Distribution of Profile Metrics', fontsize=16);
_ = fig_count()
plt.rcParams['figure.figsize'] = (12, 5)
plt.subplot(1, 2, 1)
sns.distplot(df_partners['followers_count'])
plt.subplot(1, 2, 2)
sns.distplot(df_partners['tweet_count'])
plt.suptitle(f'Fig. {fig_n}: Distribution of Followers and Tweet Count',
fontsize=16)
plt.show();
_ = fig_count()
/opt/conda/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /opt/conda/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
df_partner_tweets.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 88135 entries, 0 to 88134 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 lang 88135 non-null object 1 id 88135 non-null object 2 created_at 88135 non-null datetime64[ns, UTC] 3 possibly_sensitive 88135 non-null int64 4 author_id 88135 non-null object 5 conversation_id 88135 non-null object 6 text 88135 non-null object 7 in_reply_to_user_id 16544 non-null object 8 retweet_count 88135 non-null int64 9 reply_count 88135 non-null int64 10 like_count 88135 non-null int64 11 quote_count 88135 non-null int64 12 impression_count 88135 non-null int64 dtypes: datetime64[ns, UTC](1), int64(6), object(6) memory usage: 8.7+ MB
df_partner_tweets.describe()
| possibly_sensitive | retweet_count | reply_count | like_count | quote_count | impression_count | |
|---|---|---|---|---|---|---|
| count | 88135.00000 | 88135.000000 | 88135.000000 | 8.813500e+04 | 88135.000000 | 8.813500e+04 |
| mean | 0.00733 | 430.759403 | 87.710773 | 2.827568e+03 | 86.941567 | 4.789302e+04 |
| std | 0.08530 | 4512.823739 | 924.958186 | 2.460507e+04 | 1585.003811 | 6.603273e+05 |
| min | 0.00000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000e+00 |
| 25% | 0.00000 | 0.000000 | 0.000000 | 4.000000e+00 | 0.000000 | 0.000000e+00 |
| 50% | 0.00000 | 4.000000 | 1.000000 | 3.100000e+01 | 0.000000 | 0.000000e+00 |
| 75% | 0.00000 | 33.000000 | 9.000000 | 2.340000e+02 | 4.000000 | 5.474500e+03 |
| max | 1.00000 | 349510.000000 | 74964.000000 | 1.886537e+06 | 163748.000000 | 1.110853e+08 |
df_partner_tweets.hist(figsize=(20,10))
plt.suptitle(f'Fig. {fig_n}: Distribution of Tweet Metrics',
fontsize=16);
_ = fig_count()
df_partner_tweets.retweet_count.value_counts()
0 24629
1 8350
2 5346
3 3740
4 2905
...
6054 1
11281 1
7666 1
2250 1
2006 1
Name: retweet_count, Length: 4075, dtype: int64
r_cnt = df_partners['rating'].value_counts()
print(f' No Followback = {r_cnt[0]} \
\n Follow back = {r_cnt[1]}')
df_partners['rating'].value_counts().plot(kind='bar')
plt.title(f'Fig. {fig_n}: Distribution of Account-Following count.'
'\n1 indicates a Follow-back, 0 otherwise.', fontsize=16);
_ = fig_count()
No Followback = 856 Follow back = 59
Figure 6. The Project Methodology.
The influencers' profile and tweets were scraped using Twitter API. The profile description (i.e., Bio) and tweets of each were subjected to Text Pre-processing techniques before undergoing Clustering and the Recommendation System. Similarly, the profile of the brand (Dove) was obtained.
_ = fig_count()
The tweets were first cleaned by removing the links, usernames, and any unnecessary characters such as double spaces or extra punctuations as these were of no use to our research scope.
Bag-of-Words Representation
After cleaning, the tweets were then represented as a bag-of-words vector [2] where each component corresponds to a unique word (token or term) and its value represents the number of times the word occurred in the text.
Lemmatization
The bag-of-words were then lemmatized to reduce raw words into their lemmas. For this, the `WordNetLemmatizer` function imported from the `nltk.stem` library was applied.
In linguistics, the process of lemmatization relates to the mechanism of clustering together inflected forms of a word and converting them to their lemma or dictionary-form terms [3]. Instead of stemming, another text-processing technique that modifies words into their root words, lemmatization is more accurate as it algorithmically processes and determines the lemma based on the word’s intended meaning [4, 5].
Term Frequency-Inverse Document Frequency (TF-IDF)
Given the numerous words gathered after lemmatization, the `TfidfVectorizer` function from the `sklearn.feature_extraction.text` library was used to determine which words are relevant to the text, and penalize the frequently-occurring words.
TF-IDF measures the relevance of each word in a document such that the more frequent it appears, the more relevant it is. However, the rarity of the word is also taken into account to ensure that the word is frequently appearing not because it is a common word, but because it is relevant [6, 7].
def remove_links(tweet):
# Reference: https://ourcodingclub.github.io/tutorials/topic-modelling-python/
"""
Remove web links from a given text
Parameters
----------
tweet : string
Text to be stripped of web links
Returns
-------
tweet : string
Text without webs links
"""
tweet = re.sub(r'http\S+', '', tweet) # remove http links
tweet = re.sub(r'bit.ly/\S+', '', tweet) # remove bitly links
return tweet
def remove_users(tweet):
# Reference: https://ourcodingclub.github.io/tutorials/topic-modelling-python/
"""
Remove user account information and retweet tags from a given text
Parameters
----------
tweet : string
Text to be stripped of web links
Returns
-------
tweet : string
Text user information and retweet tags
"""
tweet = re.sub('(RT\s@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove retweet
tweet = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove tweeted at
return tweet
def clean_tweet(tweet):
# Reference: https://ourcodingclub.github.io/tutorials/topic-modelling-python/
"""
Transform a given text into lowercase format, remove any user account
information, retweet tags, and special characters.
Parameters
----------
tweet : string
Text to be cleaned and formatted
Returns
-------
tweet : string
Cleaned text
"""
tweet = remove_users(tweet)
tweet = remove_links(tweet)
tweet = re.sub(r'[^a-zA-Z]', ' ', tweet.lower()) # lowercase letters
tweet = re.sub(fr'[{punctuations}]+', ' ', tweet) # strip punctuation
tweet = re.sub('\s+', ' ', tweet) #remove double spacing
tweet_tokens = tweet.split(' ') #regex_tokenizer.tokenize(tweet) #
tweet_tokens = [WordNetLemmatizer().lemmatize(w) for w in tweet_tokens
if w not in stop_words]
tweet = ' '.join(tweet_tokens)
return tweet
def text_remove_unicode(text):
text = re.sub(
r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?",
"", text)
return text
class Lemmatizer:
"""Lemmatize text using WordNet"""
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, text):
return [
self.wnl.lemmatize(word)
for word
in re.findall(r"(?u)(?<!<)[a-z]{2,}", text)
]
# define tagalog stop words and bad words [R]
tl_stopwords= ["pldt","pldtcares","pldthome","akin","aking","ako","alin",
"am","amin","aming","ang","ano","anumang","apat","at","atin",
"ating","ay","bababa","bago","bakit","bawat","bilang","dahil",
"dalawa","dapat","din","dito","doon","gagawin","gayunman",
"ginagawa","ginawa","ginawang","gumawa","gusto","habang",
"hanggang","hindi","huwag","iba","ibaba","ibabaw","ibig",
"ikaw","ilagay","ilalim","ilan","inyong","isa","isang",
"itaas","ito","iyo","iyon","iyong","ka","kahit","kailangan",
"kailanman","kami","kanila","kanilang","kanino","kanya",
"kanyang","kapag","kapwa","karamihan","katiyakan","katulad",
"kaya","kaysa","ko","kong","kulang","kumuha","kung","laban",
"lahat","lamang","likod","lima","maaari","maaaring","maging",
"mahusay","makita","marami","marapat","masyado","may",
"mayroon","mga","minsan","mismo","mula","muli","na",
"nabanggit","naging","nagkaroon","nais","nakita","namin",
"napaka","narito","nasaan","ng","ngayon","ni","nila","nilang",
"nito","niya","niyang","noon","o","pa","paano","pababa",
"paggawa","pagitan","pagkakaroon","pagkatapos","palabas",
"pamamagitan","panahon","pangalawa","para","paraan","pareho",
"pataas","pero","pumunta","pumupunta","sa","saan","sabi",
"sabihin","sarili","sila","sino","siya","tatlo","tayo",
"tulad","tungkol","una","walang", "nyo", "niyo", "naman",
"mo", "pls", "po", "kayo", "ba", "hi", "hello", "wala", "u",
"nung", "nang", "kami", "kmi", "amp", "beh", "rin", "din",
"jusko", "ha", "g", "kasi", "lang", "pi", "nadin", "narin",
"e", "eh", "nga", "hey", "huy", "kayong", "nag", "paki", "pls"]
tl_badwords = ["amputa","animal ka","bilat","binibrocha","bobo","bogo",
"boto","brocha","burat","bwesit","bwisit","demonyo ka",
"engot","etits","gaga","gagi","gago","habal","hayop ka",
"hayup","hinampak","hinayupak","hindot","hindutan","hudas",
"iniyot","inutel","inutil","iyot","kagaguhan","kagang",
"kantot","kantotan","kantut","kantutan","kaululan","kayat",
"kiki","kikinginamo","kingina","kupal","leche","leching",
"lechugas","lintik","nakakaburat","nimal","ogag","olok",
"pakingshet","pakshet","pakyu","pesteng yawa","poke","poki",
"pokpok","poyet","pu'keng","pucha","puchanggala","puchangina",
"puke","puki","pukinangina","puking","punyeta","puta","putang",
"putang ina","putangina","putanginamo","putaragis","putragis",
"puyet","ratbu","shunga","sira ulo","siraulo","suso","susu",
"tae","taena","tamod","tanga","tangina","taragis","tarantado",
"tete","teti","timang","tinil","tite","titi","tungaw","ulol",
"ulul","ungas", "yawa"]
stop_words = stopwords.words('english') + tl_stopwords + tl_badwords
punctuations = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'
exclude_words = stopwords.words(
'english') + tl_stopwords + tl_badwords + ['twitter', 'account', 'official', 'new']
def plot_wordcloud(data_dict, title):
"""
Plot a word cloud of a set of input words
Parameters
----------
data_dict : dict
Dictionary whose keys are the words, and their frequency as values
title: str
Title of the figure to plot
"""
c = Counter(data_dict)
res = {key: val for key, val in sorted(c.items(), key = lambda ele: ele[1], reverse=True)}
mask_img = np.array(Image.open('dove_img.png'))
wordcloud = (WordCloud(background_color ='white', colormap='Blues_r', #'gist_heat'
width=1500, height=800, mask=mask_img,
collocations=False, random_state=randstate)
.generate_from_frequencies(res))
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.title(f'Fig. {fig_n}: {title}', fontsize=15)
# plt.savefig(f'wc_{title}.png', dpi=150, bbox_inches='tight')
_ = fig_count()
plt.show()
Figure 7. The Model Pipeline.
The influencers was clustered based on their profile description. For each cluster, the item profiles and were used along with the Dove's ratings to create a user profile. This was then used as a basis for the Content-Based recommendations per cluster. Afterwards, the performance of the system was evaluated.
_ = fig_count()
When recommending influencers, it's essential to consider Dove's target audience's specific characteristics and interests. By clustering Twitter users prior employing recommender systems, we can identify different groups of influencers with similar demographic characteristics and interests. This allows us to provide specialized recommendations for each cluster based on the specific needs and preferences of Dove.
In identifying the natural clustering of influencers, features such as their activity, location, follower ratio, tenure, and keywords used in their bio were considered. Their Twitter bio descriptions are further broken down or reduced to keywords through the process of TF-IDF and lemmmatization.
def df_for_clustering(df, to_drop=True):
"""
Function that preprocesses features necessary for clustering
"""
def filter_char(c): return ord(c) < 256
data = df.copy()
data['description'] = (data['description'].str.lower()
.apply(lambda s: ''.join(filter(filter_char, s)))
.apply(text_remove_unicode)
.apply(lambda x: " ".join([re.sub('[^A-Za-z]+',
'', x)
for x in
nltk.word_tokenize(
x)
]))
.apply(lambda x: re.sub(' +', ' ', x))
.apply(lambda x: " ".join([x for x in x.split()
if x not in
exclude_words
]))
)
data['tenure'] = 2023 - pd.to_datetime(data.created_at).dt.year
data['has_location'] = np.where(data.location.isna(), 0, 1)
data['follower_ratio'] = data.followers_count/data.following_count
data['follower_ratio'] = (data.follower_ratio.replace(np.inf, np.nan)
.fillna(data.followers_count))
tfidfvectorizer = TfidfVectorizer(
token_pattern=None,
tokenizer=Lemmatizer(),
stop_words=stop_words+exclude_words+["u", "im", "dont"],
max_df=0.7,
min_df=0.01
).fit(data.description)
data_descrip = pd.DataFrame(
tfidfvectorizer.transform(data.description).todense(),
columns=tfidfvectorizer.get_feature_names_out(),
index=range(1, len(data.description)+1)
).reset_index(drop=True)
final = (pd.concat([data,
data_descrip], axis=1)
.set_index(['id', 'username', 'name']))
if to_drop:
final.drop(columns=['followers_count', 'following_count',
'description', 'created_at', 'location',
'protected'],
inplace=True)
final[['tweet_count',
'listed_count',
'tenure',
'follower_ratio']] = (StandardScaler().fit_transform(
final[['tweet_count', 'listed_count',
'tenure', 'follower_ratio']])
)
return final
# Read the table that contains the features for clustering, and generate
df_ = (pd.read_sql("""SELECT * FROM partners""", conn)
.query('protected==0', engine='python')
.reset_index(drop=True)
.drop(columns=['url', 'included'])
)
df_clean = df_for_clustering(df_)
/home/msds2023/mmenorca/.local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:409: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['animal', 'demonyo', 'doe', 'hayop', 'ina', 'keng', 'pesteng', 'pu', 'sira', 'ulo', 'wa'] not in stop_words. warnings.warn(
The keyword extraction from Twitter bios resulted in a higher dimensional and sparse matrix. Due to these reasons, dimensionality reduction via t-SVD was performed prior feeding into the clustering algorithm.
def truncated_svd(X):
"""
Function that accepts the design matrix and returns q, sigma, p and the
normalized sum of squared distance from the origin
"""
q, s, p = np.linalg.svd(X, full_matrices=False)
Q = q
S = np.diag(s)
P = p.T
NSSD = (s / np.sqrt(np.sum(s**2)))**2
return Q, S, P, NSSD
def min_svs(df):
"""
Function to get the minimum number of Singular Vectors that explain
at least 80% of variance
"""
q, s, p, nssd = truncated_svd(df)
nssd_cumsum = nssd.cumsum()
return np.argwhere(nssd_cumsum >= 0.75)[0][0]+1
# Dimension reduction process
q_, s_, p_, nssd_ = truncated_svd(df_clean)
svd_ = TruncatedSVD(n_components=min_svs(df_clean),
random_state=1337,
algorithm='arpack')
df_svd = svd_.fit_transform(df_clean.astype(float))
In this study, we made use of the $k$-medoids clustering algorithm since it is more robust to outliers, can handle a mix of numeric and categorical data, and chooses actual data points as centers in each iteration. Other clustering methods were also explored such as Agglomerative clustering, where the resulting cluster we're similar to $k$-Medoids, and Density-based algorithms, where only 1 cluster was identified. Implementation of these algorithms can be found in the supplementary notebook other_clustering_methods.ipynb
def cluster_range_kmedoids(X, k_start, k_stop, actual=None):
"""
Function that accepts the design matrix, the initial and final values to
step through, and, optionally, actual labels. It returns a dictionary of
the cluster labels, cluster centers, internal validation values and,
if actual labels is given, external validation values, for every 𝑘
"""
ys = []
cs = []
inertias = []
chs = []
scs = []
dbs = []
gss = []
gssds = []
ps = []
amis = []
ars = []
dist = euclidean
X = np.asarray(X)
for k in trange(k_start, k_stop+1):
clusterer_k = kmedoids(X, np.arange(k), ccore=True)
clusterer_k.process()
clusters = clusterer_k.get_clusters()
y_pred = np.zeros(len(X), dtype=int)
for cluster, point in enumerate(clusters):
y_pred[point] = cluster
centers = X[clusterer_k.get_medoids()]
ys.append(y_pred)
cs.append(centers)
res_dict = dict(zip(['ys', 'centers'], [ys, cs]))
# internal validation metrics
sse = np.sum([euclidean(x, c) ** 2 for i, c
in enumerate(centers) for x in X[y_pred == i]])
inertias.append(sse) # SS to centroids
chs.append(calinski_harabasz_score(X, y_pred)) # Calinski-Hanbaz
scs.append(silhouette_score(X, y_pred)) # Sillhouete score
dbs.append(davies_bouldin_score(X, y_pred)) # Davies-Bouldin
keys = ['inertias', 'chs', 'scs', 'dbs']
values = [inertias, chs, scs, dbs]
internal_dict = dict(zip(keys, values))
res = {**res_dict, **internal_dict}
return res
# Perform k-medoids clustering using k values from 2 to 10
res_kmedoid = cluster_range_kmedoids(df_svd, 2, 10)
100%|██████████| 9/9 [00:08<00:00, 1.01it/s]
To determine the optimal clusters in K-Medoids, Silhouette Coefficient (SC), Calinski-Harabasz (CH), and Davies-Bouldin (DB) score were used. High values of SC and CH, while low values of DB are desired. Based on these internal validation metrics, the optimal number of clusters obtained is 4. However, the last cluster contains only one data point/Twitter user (i.e., Taylor Swift); hence we drop this and retain only 3 final clusters.
def plot_internal(chs, scs, dbs):
"""Plot internal validation values"""
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
ks = np.arange(2, len(chs)+2)
axes[0].plot(ks, chs, '-ro', label='CH')
axes[0].set_xlabel('$k$')
axes[0].set_ylabel('CH')
axes[1].plot(ks, scs, '-ko', label='Silhouette coefficient')
axes[1].set_xlabel('$k$')
axes[1].set_ylabel('Silhouette')
axes[2].plot(ks, dbs, '-gs', label='DB')
axes[2].set_xlabel('$k$')
axes[2].set_ylabel('DB')
plt.subplots_adjust(wspace=0.4)
plt.title(f'Fig. {fig_n}: Clustering Internal Validation Criteria', fontsize=16)
_ = fig_count()
return axes
def plot_clusters(X, ys, centers, transformer):
"""Plot clusters given the design matrix and cluster labels"""
k_max = len(ys) + 1
k_mid = k_max//2 + 2
fig, ax = plt.subplots(2, k_max//2, dpi=150, sharex=True, sharey=True,
figsize=(7, 4), # subplot_kw=dict(aspect='equal'),
gridspec_kw=dict(wspace=0.01))
for k, y, cs in zip(range(2, k_max+1), ys, centers):
centroids_new = transformer.transform(cs)
if k < k_mid:
ax[0][k % k_mid-2].scatter(*zip(*X), c=y, s=1, alpha=0.8)
ax[0][k % k_mid-2].scatter(
centroids_new[:, 0],
centroids_new[:, 1],
s=10,
c=range(int(max(y)) + 1),
marker='s',
ec='k',
lw=1
)
ax[0][k % k_mid-2].set_title('$k=%d$' % k)
else:
ax[1][k % k_mid].scatter(*zip(*X), c=y, s=1, alpha=0.8)
ax[1][k % k_mid].scatter(
centroids_new[:, 0],
centroids_new[:, 1],
s=10,
c=range(int(max(y))+1),
marker='s',
ec='k',
lw=1
)
ax[1][k % k_mid].set_title('$k=%d$' % k)
ax.suptitle(f'Fig. {fig_n}: $k-$medoids clusters', fontsize=16)
_ = fig_count()
return ax
def plot_describe_cluster(df, cluster):
"""
Function that outputs analysis per cluster via wordcloud and radar plots
"""
dict_clustname = {1: 'News/Entertainment Outlets', 2: 'Celebrities',
3: 'Social Media Macro & Micro Influencers'}
dict_color_radar = {1: '#feae02', 2: '#888888', 3: '#1c43c9'}
dict_color_cloud = {1: 'Wistia', 2:'gray', 3:'Blues'}
fig, ax = plt.subplots(1, 1, figsize=(10, 6), dpi=100)
# Wordcloud
text = ' '.join([word for word in
set(df[(df.cluster == cluster)]
['description'])
if word not in exclude_words])
mask_img = np.array(Image.open('dove_img.png'))
wordcloud = WordCloud(background_color='white',
collocations=False,
mask=mask_img,
colormap=dict_color_cloud[cluster]).generate(text)
ax.imshow(wordcloud, interpolation="bilinear")
plt.title(
f'Fig. {fig_n}: Cluster {cluster}: {dict_clustname[cluster]}, {df[(df.cluster == cluster)].shape[0]} Twitter accounts')
_ = fig_count()
ax.axis('off')
plt.show()
# Radar Plot
df_med = (df
[['followers_count', 'following_count', 'tweet_count',
'listed_count', 'tenure', 'cluster']]
.groupby('cluster').agg('median')
)
df_med.iloc[:, :] = MinMaxScaler().fit_transform(df_med)
categories = df_med.columns.tolist()
categories = [*categories, categories[0]]
fig = go.Figure()
r_ = df_med.iloc[cluster-1].values[0:].tolist()
r_ = [*r_, r_[0]]
fig.add_trace(go.Scatterpolar(
r=r_,
theta=categories,
fill='toself',
name=str(df_med.index[cluster-1]),
line_color=dict_color_radar[df_med.index[cluster-1]],
opacity=0.7
))
fig.update_layout(template=None, plot_bgcolor="rgba(0,0,0,0)",
paper_bgcolor="rgba(0,0,0,0)",
polar=dict(radialaxis=dict(angle=90,
tick0=1,
dtick=0.5,
range=[-1, 1.45],
tickangle=90,
titlefont={"size": 15, }),
angularaxis=dict(rotation=162,
tickfont={"size": 15})),
showlegend=False)
return fig
plot_internal(res_kmedoid['chs'],
res_kmedoid['scs'], res_kmedoid['dbs']);
20 tweets were randomly sampled and were used to create an item profile for each influencer. These tweets then undergone the text pre-processing pipeline described in section VII.A. The resulting item profile is a matrix of influencers as rows, the keywords of their tweets as columns, and the Term Frequency-Inverse Document Frequency of the keywords as the values.
tweet_count = 20
d_partner_tweets = {}
# Randomly sample 20 tweets per partner
for idx, k in df_partner_tweets.groupby('author_id'):
author = k.author_id.unique()[0]
try:
k_sample = k.sample(n=tweet_count) #, random_state=randstate
d_partner_tweets[author] = ' '.join(k_sample.text)
except ValueError:
d_partner_tweets[author] = ' '.join(k.text)
# Clean the sample tweets
df_partner_tweets_s = (pd.DataFrame({'text': d_partner_tweets})
.reset_index().rename(columns={'index': 'author_id'}))
df_partner_tweets_s['clean_text'] = df_partner_tweets_s.apply(lambda x: clean_tweet(x.text), axis=1)
# Vectorize
tfidfvectorizer = TfidfVectorizer(
token_pattern=r'[a-z-]+',
max_df=0.7,
min_df=0.05
).fit(df_partner_tweets_s.clean_text)
# Create the item profiles of each influencer
df_item_profiles = pd.DataFrame(
tfidfvectorizer.transform(df_partner_tweets_s.clean_text).todense(),
columns=tfidfvectorizer.get_feature_names_out(),
index=range(1, len(df_partner_tweets_s.clean_text)+1)
)
User Rating
In the absence of explicit user ratings, a weighted score was created as a measure of how much Dove "likes" a particular influencer. As shown in equation (\ref{eq:weighted_score}), this weighted score, $w_r$ considers whether the influencer is within the brand's social circle or not (i.e., Dove is following them or not), and the influencer's average tweets per year and number of followers. As expected, $w_r$ will just be equal to 0 for influencers that Dove does not follow.
\begin{equation} w_r = \beta_r \times \left( 0.60 \times f_r + 0.40 \times t_{r, avg} \right) \tag{1} \label{eq:weighted_score} \end{equation} \begin{equation} u_r = \frac{cf_{b, w_r} + 0.5f_{w_r}}{N} \times 100 \tag{2} \label{eq:user_rating} \end{equation}For each influencer, $r$, $\beta$ is the binary rating given by Dove such that the value is 1 if Dove follows them, and 0 otherwise, $f_r$ is its number of followers, and $t_{r, avg}$ is its average tweets per year. Note that $w_r$ will be equal to 0 if Dove does not follow the influencer.
The final user rating, $u_r$ is the percentile rank of each influencer based on $w_r$ as shown in equation (\ref{eq:user_rating}). $cf_{b, w_r}$ is the Cumulative Frequency below $w_r$, $f_{w_r}$ is the frequency of the rating, and $N$ is the total number of influencers observed.
Aggregated User Profile
The mean-centered user ratings, $u_r$ was used as a weight to each influencers' item profile. The resulting user profile is then the weighted sum of the item profiles of the rated influencers.
# Add derived variables
df_partners['tenure'] = pd.datetime.now().year - df_partners.created_at.dt.year
df_partners['avg_tweets_per_year'] = df_partners.tweet_count / df_partners.tenure
# Estimate a rating based on followers and tweet count
rated = df_partners.loc[df_partners.rating == 1]
df_partners['w_rating'] = (df_partners.loc[rated.index]
.apply(lambda x: (x.followers_count*0.60)
+ (x.avg_tweets_per_year*0.40),
axis=1))
df_partners['w_rating_pct'] = df_partners.loc[rated.index].w_rating.rank(pct=True, ascending=True)
/tmp/ipykernel_9999/2194494515.py:2: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead. df_partners['tenure'] = pd.datetime.now().year - df_partners.created_at.dt.year
def compute_user_profile_agg_numeric(df_utility, df_item_profiles):
"""
Return the profile of a given user with numeric ratings.
Parameters
----------
df_utility : pandas DataFrame
Utility matrix
df_item_profiles : pandas DataFrame
Item profiles
user : int
User (joke) ID
Returns
-------
pandas Series
Numeric user profile
"""
weights = (df_utility - df_utility.mean())
return (df_item_profiles * weights.to_numpy()[:, np.newaxis]).mean()
The Cosine distance [2] given by equation (\ref{eq:cos_dist}) between the item profile of each influencer, $\vec v_1$ and the user profile, $\vec v_2$ was calculated to determine which influencers have "similar" tone, ideals, and values as the brand, Dove.
For each cluster, $L = 40$ influencers were initially chosen for review, then 5 were shortlisted to present to Dove as a recommendation.
def recommend_agg(df_utility, df_item_profiles, user_profile, L):
"""
Return a list of recommended unrated items to a user, sorted
from most recommended to least then by joke id.
Parameters
----------
df_utility : pandas DataFrame
Utility matrix
df_item_profiles : pandas DataFrame
Item profiles
user_profiles : pandas DataFrame
User profiles
user : int
User (joke) ID
L : int
Number of recommendations
Returns
-------
list
IDs of recommended items
"""
unrated_idx = df_utility[df_utility.isnull()].index
s_reco = (df_item_profiles.loc[unrated_idx]
.apply(lambda x: cosine(user_profile, x), axis=1))
d_reco = sorted(s_reco[s_reco > 0].to_dict().items(), key=lambda x: (x[1], x[0]))
return list(list(zip(*d_reco))[0][:L])
The performance of the recommender system was evaluated based on how useful the recommendations are to the brand. In this case, a recommendation is deemed useful if the more important or similar influencers appear first in the list.
A popular way of measuring this is via the discounted cumulative gain (DCG) and normalized discounted cumulative gain (NDCG) [8]. These are defined as
$$ DCG = \frac{1}{m} \sum_{u=1}^m \sum_{j \in I_u, v_j \leq L} \frac{2^{r_{uj}}}{\log_2 \left(v_j +1\right)}; \\ NDCG = \frac{DCG}{IDCG}, $$where $L$ is the number of recommended items, $v_j \in {1...L}$ is the rank of the item in the recommendation, $r_{uj}$ is the actual rating of user $u$ on item $j$ and $I_u$ is the set of items rated by user $u$. IDCG is the idealized DCG which is DCG when the sorting follows the ground-truth rankings.
The values of NDCG ranges from 0 to 1, with 1 indicating the best performance.
def evaluate_recsys(sim_size, N, df_utility, df_item_profiles):
"""
Return the Discounted Cumulative Gain and Normalized Discounted
Cumulative Gain of a recommender system.
Parameters
----------
sim_size : int
Number of simulations or trials
N : int
Number of recommendations to be made per trial
df_utility : pandas DataFrame
Utility matrix
df_item_profiles : pandas DataFrame
Item profiles
Returns
-------
dcg : float
Discounted Cumulative Gain
ndcg : float
Normalized Discounted Cumulative Gain
"""
dcg = []
ndcg = []
np.random.seed(randstate)
for s in range(sim_size):
# if s % 5 == 0:
# print(s)
rated = df_utility.dropna()
test_idx = np.random.choice(rated.index, size=N, replace=False)
rated[test_idx] = np.nan
eval_item_profiles = df_item_profiles.loc[rated.index]
eval_user_profile = compute_user_profile_agg_numeric(rated, eval_item_profiles)
y_pred = recommend_agg(rated, eval_item_profiles, eval_user_profile, N)
y_true = (df_utility[y_pred].sort_values(ascending=False, kind='mergesort')
.index.tolist())
pred_rel = {y_pred[r]: len(y_pred)-r for r in range(len(y_pred))}
true_rel = {y_true[r]: len(y_true)-r for r in range(len(y_true))}
df_rel = (pd.DataFrame([pred_rel, true_rel],
index=['pred_rel', 'true_rel']).T)
dcg.append(dcg_score([df_rel.true_rel.tolist()], [df_rel.pred_rel.tolist()]))
ndcg.append(ndcg_score([df_rel.true_rel.tolist()], [df_rel.pred_rel.tolist()]))
return dcg, ndcg
The potential partners of Dove were clustered into three groups based on the results of the internal validation metrics, namely News & Entertainment Outlets, Celebrities, and Social Media Macro & Micro Influencers.
kmo = kmedoids(np.asarray(df_svd), np.arange(4), ccore=True)
kmo.process()
clusters = kmo.get_clusters()
y_kmedoid = np.zeros(len(df_svd))
for cluster, point in enumerate(clusters):
y_kmedoid[point] = cluster
df_clusters = df_for_clustering(df_, to_drop=False)
df_clusters['cluster'] = np.int64(y_kmedoid)
df_clusters.cluster = df_clusters.cluster.map({2:-1, 0:1, 1:2, 3:3})
df_clusters = df_clusters[df_clusters.cluster > 0].reset_index()
plt.figure(dpi=100, figsize=(6, 4))
plt.scatter(df_svd[:, 0], df_svd[:, 1], c=y_kmedoid)
plt.title(f'Fig. {fig_n}: Final K-medoids clustering\n' \
'retaining only 3 clusters', fontsize=14)
_ = fig_count()
plt.ylim([-5, 10])
plt.show()
/home/msds2023/mmenorca/.local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:409: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['animal', 'demonyo', 'doe', 'hayop', 'ina', 'keng', 'pesteng', 'pu', 'sira', 'ulo', 'wa'] not in stop_words. warnings.warn(
Cluster 1: News & Entertainment Outlets.
This cluster of potential partners are mostly composed of news and entertainment accounts. As observed in the word cloud, their Twitter bio descriptions mostly pertain to news such as `breaking`, `latest`, and `stories`. In terms of quantitative features, these potential partners have relatively high follower-to-following ratio, tweet counts, and longest tenure in the Twitter platform.
plotly.offline.init_notebook_mode()
plot_describe_cluster(df_clusters, 1)
Cluster 2: Celebrities.
This cluster of potential partners are mostly composed of celebrity accounts. Its bio wordcloud compose of keywords relating to official accounts of celebrities such as `instagram`, `singer`, and `host`. In terms of quantitative features, these potential partners have the lowest follower-to-following ratio, tweet counts, and shortest tenure in the Twitter platform relative to the other clusters obtained. This may mean that most of these accounts are not as active as news outlets and social media influencers.
plot_describe_cluster(df_clusters, 2)
Cluster 3: Social Media Macro & Micro Influencers.
This cluster of potential partners are mostly composed of influencers that rose to fame through social media platforms like Youtube, Instagram, and TikTok. It's worth noting that some accounts are also celebrities who may have been placed into this group due to similarities in their behavior to social media influencers. As observed in the word cloud, their Twitter bio descriptions mostly pertain to news such as `youtube`, `fashion`, and `travel`. In terms of quantitative features, these potential partners have relatively high following counts and long tenure.
plot_describe_cluster(df_clusters, 3)
The recommendations made in this section assumes that Dove PH aims to work with local companies and influencers only. Thus, international companies and influencers were excluded.
Note that despite the exclusion of non-recommendable partners, the order in which the recommmendable partners appeared was still followed.
# Merge to follow the index of the items
df_partners = (df_partner_tweets_s[['author_id']]
.merge(df_partners, left_on='author_id', right_on='id', how='inner')
.merge(df_clusters[['id', 'cluster']], on='id', how='left')
.drop('author_id', axis=1))
df_partners.index = (range(1, len(df_partner_tweets_s)+1))
# Replace 0 ratings with null
df_partners.rating.replace({0: np.nan}, inplace=True)
# Define the utility matrix
brand_utility = df_partners.w_rating_pct
d_user_profiles = {}
d_recos = {}
L = 40
for cluster, s in df_partners.groupby('cluster'):
if cluster != -1: # outlier
c_utility = brand_utility[s.index]
c_items = df_item_profiles.loc[s.index]
c_user_profile = compute_user_profile_agg_numeric(c_utility, c_items)
d_user_profiles[cluster] = c_user_profile
d_recos[cluster] = recommend_agg(c_utility, c_items, c_user_profile, L)
If Dove wishes to partner with News/Entertainment outlets, they can consider ABS-CBN News (`ABSCBNNews`), SKY (`SKYserves`), Star Cinema (`StarCinema`), CNN (`CNN`), and SMART (`LiveSmart`) as potential partners. The defining characteristic of these influencers' content that matched with that of Dove's is the presence of words such as thank, movie, help, free, message, and engage.
# News & Entertainment companies
c1_recos = df_partners.loc[d_recos[1]]
# Assumption: Dove PH can only work with PH companies
c1_valid = [170, 128, 637, 825, 818]
c1_recos_ph = c1_recos.loc[c1_valid]
c1_recos_ph
| id | description | created_at | username | protected | name | url | location | followers_count | following_count | tweet_count | listed_count | included | rating | tenure | avg_tweets_per_year | w_rating | w_rating_pct | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 170 | 15872418 | Stories, video, and multimedia for Filipinos w... | 2008-08-16 10:09:33+00:00 | ABSCBNNews | 0 | ABS-CBN News | https://t.co/9yfQzguRRD | Manila, Philippines | 8917322 | 1080 | 1043998 | 8542 | 1.0 | NaN | 15 | 69599.866667 | NaN | NaN | 1.0 |
| 128 | 150165941 | 2010-05-31 07:28:03+00:00 | SKYserves | 0 | SKYserves | https://t.co/dEHArWO9y1 | Philippines | 134029 | 12830 | 622443 | 188 | 1.0 | NaN | 13 | 47880.230769 | NaN | NaN | 1.0 | |
| 637 | 39956328 | This is the OFFICIAL Twitter account of Star C... | 2009-05-14 08:37:24+00:00 | StarCinema | 0 | Star Cinema | https://t.co/2ksfh8SqTN | Philippines | 1936568 | 588 | 403436 | 661 | 1.0 | NaN | 14 | 28816.857143 | NaN | NaN | 1.0 |
| 825 | 759251 | It’s our job to #GoThere & tell the most diffi... | 2007-02-09 00:35:02+00:00 | CNN | 0 | CNN | https://t.co/imGp4Ieixi | None | 61258353 | 1095 | 399299 | 157950 | 1.0 | NaN | 16 | 24956.187500 | NaN | NaN | 1.0 |
| 818 | 74409069 | The official Twitter account of Smart Communic... | 2009-09-15 09:38:04+00:00 | LiveSmart | 0 | SMART | https://t.co/2P0bZUPKVK | Philippines | 1525578 | 48843 | 414056 | 1420 | 1.0 | NaN | 14 | 29575.428571 | NaN | NaN | 1.0 |
c1_recos_ph_tweets = df_partner_tweets_s.loc[df_partner_tweets_s.author_id.isin(c1_recos_ph.id)]
c1_tokens = np.concatenate(c1_recos_ph_tweets.clean_text.str.split().tolist())
plot_wordcloud(c1_tokens, 'Tweets of Recommended News/Entertainment Outlets')
If Dove wishes to partner with Celebrities, they can consider Janine Gutierrez (`janinegutierrez`), Ylona Garcia (`ylona_garcia`), Xian Lim (`XianLimm`), Liza Soberano (`lizasoberano`), and Shamcey Supsup (`supsup_shamcey`) as potential partners. The defining characteristic of these influencers' content that matched with that of Dove's is the presence of words such as thank, love, happy, watching, feel, and guys.
Dove mostly partner with woman to echo their campaign, however in the results presented Xian Lim can be considered a serendipitous result. This may possibly be due to his recent blog this year, 2023, which is centered on "self-love." The vlog's topic may have been associated with "empowerment"--one of the major values being promoted by Dove.
Other results lead to Janine Gutierrez, a National Youth Ambassador; Shamcey Supsup, a very influential Miss Universe Candidate; and Liza Soberano which rebranded herself just recently. All of which are relevant potential partners for the brand.
# Celebrities
c2_recos = df_partners.loc[d_recos[2]]
c2_valid = [405, 443, 448, 490, 489]
c2_recos_ph = c2_recos.loc[c2_valid]
c2_recos_ph
| id | description | created_at | username | protected | name | url | location | followers_count | following_count | tweet_count | listed_count | included | rating | tenure | avg_tweets_per_year | w_rating | w_rating_pct | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 405 | 240611246 | film. fashion. family. Philippines ✨ @wwfphili... | 2011-01-20 09:40:33+00:00 | janinegutierrez | 0 | JANINE | https://t.co/KTIV1GjfYj | Manila | 226750 | 594 | 27249 | 87 | 1.0 | NaN | 12 | 2270.750000 | NaN | NaN | 2.0 |
| 443 | 2613171312 | /ee-lona/ • Wanderland - Mar 4 | 2014-07-09 08:34:15+00:00 | ylona_garcia | 0 | ylona. | https://t.co/LQMmKJmbvX | Los Angeles, CA | 700520 | 49 | 19569 | 97 | 1.0 | NaN | 9 | 2174.333333 | NaN | NaN | 2.0 |
| 448 | 264058981 | Artist/ Filmmaker/ Painter | 2011-03-11 07:42:11+00:00 | XianLimm | 0 | XIAN LIM | None | None | 3053233 | 423 | 22648 | 1730 | 1.0 | NaN | 12 | 1887.333333 | NaN | NaN | 2.0 |
| 490 | 284291853 | Imperfection is beauty instagram: @lizasoberan... | 2011-04-19 01:02:58+00:00 | lizasoberano | 0 | Liza Soberano | None | None | 4904453 | 173 | 14557 | 547 | 1.0 | NaN | 12 | 1213.083333 | NaN | NaN | 2.0 |
| 489 | 283783357 | No need to shout to be heard. Sometimes the be... | 2011-04-18 01:09:33+00:00 | supsup_shamcey | 0 | shamcey supsup lee | None | None | 300110 | 91 | 898 | 0 | 1.0 | NaN | 12 | 74.833333 | NaN | NaN | 2.0 |
c2_recos_ph_tweets = df_partner_tweets_s.loc[df_partner_tweets_s.author_id.isin(c2_recos_ph.id)]
c2_tokens = np.concatenate(c2_recos_ph_tweets.clean_text.str.split().tolist())
plot_wordcloud(c2_tokens, 'Tweets of Recommended Celebrities')
If Dove wishes to partner with Social Media Micro- & Macro-Influencers, they can consider Tirso Cruz III (`tirsocruziii`), Dolly Carjaval (`dollyannec`), Sam Pinto (`SamPinto_`), Mikael Daez (`mikaeldaez`), and Paula Taylor (`paulataylor`) as potential partners. The defining characteristic of these influencers' content that matched with that of Dove's is the presence of words such as photo, posted, boutique, resort, swipe, and get.
For this cluster, serendipitous influencer might be Tirso Cruz III. However, it is interesting to note that Tirso is a cancer survivor and has been an advocate of cancer ever since. This form of empowerment might have lead to how he was part of the recommendation for this cluster. Other potential candidates include Dolly Carvajal, an entertainment columnist; and Sam Pinto,a newly-wed and a new mom.
# Social Media influencers
c3_recos = df_partners.loc[d_recos[3]]
# Choose the top 5 only
c3_recos_ph = c3_recos.head(5)
c3_recos_ph
| id | description | created_at | username | protected | name | url | location | followers_count | following_count | tweet_count | listed_count | included | rating | tenure | avg_tweets_per_year | w_rating | w_rating_pct | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 196 | 168221275 | 2010-07-18 19:03:13+00:00 | tirsocruziii | 0 | Tirso Cruz III | None | None | 125367 | 270 | 10450 | 106 | 1.0 | NaN | 13 | 803.846154 | NaN | NaN | 3.0 | |
| 66 | 129144705 | Vodka Queen, MJ fanatic, proud single mom, hop... | 2010-04-03 09:20:27+00:00 | dollyannec | 0 | Dolly Anne Carvajal | None | on the edge:-) | 53964 | 820 | 24268 | 132 | 1.0 | NaN | 13 | 1866.769231 | NaN | NaN | 3.0 |
| 726 | 54866152 | Facebook Page - hellosampinto • Instagram and ... | 2009-07-08 11:15:31+00:00 | SamPinto_ | 0 | Sam Pinto | https://t.co/boI6v2rK9V | Republic of the Philippines | 1316879 | 313 | 30353 | 1732 | 1.0 | NaN | 14 | 2168.071429 | NaN | NaN | 3.0 |
| 674 | 46081647 | I guess i had a twitter account all along ;) h... | 2009-06-10 10:24:56+00:00 | mikaeldaez | 0 | Mikael Daez | http://t.co/u3QCvSb4L3 | Philippines | 204286 | 490 | 18905 | 95 | 1.0 | NaN | 14 | 1350.357143 | NaN | NaN | 3.0 |
| 666 | 44102517 | 2009-06-02 11:35:40+00:00 | paulataylor | 0 | Paula Taylor | http://t.co/86UQVtBi3M | None | 592520 | 99 | 8145 | 1356 | 1.0 | NaN | 14 | 581.785714 | NaN | NaN | 3.0 |
c3_recos_ph_tweets = df_partner_tweets_s.loc[df_partner_tweets_s.author_id.isin(c3_recos_ph.id)]
c3_tokens = np.concatenate(c3_recos_ph_tweets.clean_text.str.split().tolist())
plot_wordcloud(c3_tokens, 'Tweets of Recommended Social Media\nMicro- & Macro-Influencers')
From 50 trials with 5 recommendations each, the average Discounted Cumulative Gain (DCG) and Normalized Discounted Cumulative Gain (NDCG) is 9.28, and 0.904, respectively.
Given that the average NDCG is close to 1, we can conclude that the recommendations made by our system is relevant to and personalized for the brand.
# Evaluate the recommender system
sim_size = 50
N = 5
test_dcg, test_ndcg = evaluate_recsys(sim_size, N, brand_utility, df_item_profiles)
print(f'Avg. DCG: {np.mean(test_dcg):.4f} -- Avg. NDCG: {np.mean(test_ndcg):.4f}')
# Plot the evaluation results
plt.figure(figsize=(8, 4), dpi=100)
plt.plot(range(1, sim_size+1), test_ndcg, marker='o', label='DCG', color='#003B7F')
plt.xlabel('Trial')
plt.ylabel('Normalized Discounted\nCumulative Gain (NDCG)')
plt.ylim(0.65, 1.05)
plt.title(f'Fig. {fig_n}: Recommender System Performance', fontsize=15)
_ = fig_count()
# plt.savefig(f'eval_ndcg.png', dpi=150, bbox_inches='tight');
Avg. DCG: 9.2845 -- Avg. NDCG: 0.9039
The work focused on 3 social networks only: Dove's, Anne Curtis', and Alden Richards'. This introduced a constraint on the potential influencers that the model could choose from.
Given the absence of explicit ratings, the user ratings were assumed to be based on followers and activity of an influencer. This was arbitrarily chosen based on the use case.
Only 20 tweets per influencer were considered when creating the item profiles. These were assumed to be representative of the tone, values, or ideals of an influencer that could be matched with that of the brand's.
Given these limitations, future improvements may include (a) expanding the data on influencers to get a good diverse sample, (b) firming up the text pre-processing pipeline to get better clustering and recommendations, (c) identifying the keywords that best describe an influencer through TF-IDF or Topic Modeling instead of randomly selecting 20 tweets, and (d) incorporating an explainability algorithm into the pipeline to better guide brands on what features dictate the result of the matching process the most.
Influencers can be categorized into three groups: (1) News and Entertainment outlets, (2) Celebrities, and (3) Social Media Micro- and Macro-influencers. Different brands and companies will have varying preferences on who to collaborate with, hinged on a few crucial factors: budget, intended reach or scope of influence, and whether or not the influencer's values match theirs.
Depending on the goal, our Content-Based Recommendation system, with an average relevance score of 0.90, can guide a brand in choosing the right influencer to partner with in their marketing campaigns. An efficient recommender system can afford brands several advantages.
An influencer whose values, audience, and interests align with the brand's target market ensures efforts are targeted, and the campaign's message reaches the right people. Influencer marketing can be more cost-effective than traditional advertising methods. Although high agency fees and commissions can be a limiting factor, which makes using a recommender system more cost-effective - allowing brands to identify influencers with a suitable audience size and engagement rate that fit their budget.
Based on the goals for using it, a business can save man-hours that could've been spent in scouring the social media sites for potential partners. With this system, the business is already given a short-list of where to start that is more or less in line with their values as a business.
On the inverse, this system can also be used by influencers to look at potential business partners that they can reach out to. This may give them busienss opportutnities that otherwise they would have not gotten.
[1] De Jesus, S. (2016, December 5). Top 10 most followed Twitter accounts in PH for 2016. RAPPLER. https://www.rappler.com/technology/social-media/154622-top-10-most-followed-twitter-accounts-ph-2016/
[2] Alis, C. (2022). Information Retrieval and Searching by Similarity Part I.
[3] GeeksforGeeks. (2022a, November 7). Python Lemmatization Approaches with Examples. https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/
[4] Srinidhi, S. (2021d, December 13). Lemmatization in Natural Language Processing (NLP) and Machine Learning. Medium. https://towardsdatascience.com/lemmatization-in-natural-language-processing-nlp-and-machine-learning-a4416f69a7b6
[5] Bernal, D. E., Bundoc, S. F., Escalante, S. O., Guinto, J. A., Mann, J. D., Menorca, M. L. (2022). PLDT Anuna?: Topic Modeling of Customer Concerns to Streamline Service Support.
[6] GeeksforGeeks. (2023, January 19). Understanding TF IDF Term Frequency Inverse Document Frequency. https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/
[7] Góralewicz, B. (2023, January 30). The TF*IDF Algorithm Explained. Onely. https://www.onely.com/blog/what-is-tf-idf/
[8] Alis, C. (2022). Content-based Recommender Systems.
[9] Nguyen, K. (2021, July 26). The Social Media Revolution: How Social Media Has Changed Marketing. BUSINESSNAV. https://businessnav.com/the-social-media-revolution-how-social-media-has-changed-marketing/
[10] Pec, T. (2022, September 6). Why Businesses And Brands Need To Be Taking Advantage Of Social Media. Forbes. https://www.forbes.com/sites/forbesagencycouncil/2022/09/06/why-businesses-and-brands-need-to-be-taking-advantage-of-social-media/?sh=255024e6216c
[11] Geyser, W. (2023, January 20). What is Influencer Marketing? – The Ultimate Guide for 2023. Influencer Marketing Hub. https://influencermarketinghub.com/influencer-marketing/#toc-0